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Parallelizing  Locally- Weighted  Regression 

By 

Julia  Corbin  Fauntleroy 
and 

Edward  J.  Wegman 


Abstract: 

This  paper  focuses  on  a  nonparametric  regression  technique  known  as  locally-weighted 
regression  or  LOESS.  LOESS  is  a  computationally  intensive  technique  which  makes  it  naturally 
amenable  to  exploiting  high  performance  computers.  In  this  paper,  we  explore  domain 
decomposition  techniques  for  LOESS  and  study  the  performance  of  our  algorithm  on  an  Intel 
Paragon  XP/S  A4  machine.  We  study  both  speedup  and  efficiency  as  a  function  of  the  number 
of  nodes.  Certain  segments  of  the  LOESS  computation  are  shown  to  be  fruitfully  parallelized 
while  others  are  essentially  sequential  and  cannot  be  parallelized  effectively. 
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Introduction 

Regression  analysis  is  a  statistical  methodology  used  to  predict  values  of  one  or  more 
response  variables  from  a  group  of  predictor  variables.  This  methodology  results  in  the  following 
model 

Y  =  X/3  +  £,  with  0  =  (XTQ-’XTY 

where  Y  is  the  response  variables,  X  are  the  predictor  variables,  /3  is  the  vector  of  regression 
coefficients,  and  e  are  the  error  terms.  The  purpose  of  regression  analysis  is  to  find  the  estimate  of 
0  that  best  fit  the  data.  The  regression  coefficients  are  selected  using  a  least  squares  criterion  and 
referred  to  as  the  least  squares  estimates  of  0.[2] 

Locally-weighted  regression,  or  loess,  is  a  non-parametric  method  for  fitting  a  regression 
surface  using  multivariate  smoothing.  Instead  of  a  single  fit  over  all  X,  a  local  neighborhood  of  size 
k  is  determined  for  each  x,  and  a  weighted-regression  model  is  formed  for  that  neighborhood.  This 
model  is  similar  to  the  regression  model  and  can  be  expressed  as 

Y  =  XU  +  e,  where  0  =  (XrWiX>,XrW|Y  and 


W,= 

kxk 

W(  is  a  diagonal  matrix  of  weights  where  w y  is  the  weight  of  the  /th  observation  in  the 
neighborhood  of  x,. The  weights  are  determined  so  that  points  closer  to  x,  are  weighted  more  than 
those  further  away  from  x(  .  The  size  of  the  local  neighborhood  is  a  fraction  of  the  total  sample  size, 

n[  3] 


w, 


w. 


wt 
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To  calculate  a  loess  estimate,  n  regression  models  must  be  formed.  When  the  sample  size 
is  small,  the  algorithm  can  quickly  calculate  the  fitted  value  for  each  data  point.  When  the  sample 
size  becomes  large,  the  number  of  computations  performed  requires  an  extensive  amount  of 
computer  time.  This  paper  demonstrates  the  use  of  parallel  computing  to  reduce  the  time  for 
computing  fitted  regression  estimates  for  each  data  point  in  large-scale  problems. 

Constructing  a  Parallel  Algorithm 

The  code  for  the  local  regression  routine  was  obtained  from  AT&T  Bell  Labs  at 
netlib.att.com.  It  originated  from  research  by  William  S.  Cleveland,  Eric  Grosse,  and  Ming-Jen 
Shyu  and  is  the  loess()  routine  used  by  S-Plus,  a  statistical  software  package  from  StatSci.  The 
code  consists  of  both  C  and  Fortran  subroutines.  Only  the  portion  of  code  for  computing  exact 
regression  estimates  was  used  for  this  project. 

The  first  step  towards  parallelizing  the  problem  was  to  examine  the  code  in  a  purely 
sequential  setting.  The  program  was  split  into  several  sections.  As  illustrated  in  Figure  1,  each 
section  represents  either  sequential  processes  or  candidate  processes  for  parallelization.  The 
sequential  processes  handled  the  setup  and  cleanup  of  internal  working  storage,  computation  of 
residuals,  and  residual  diagnostics.  The  candidate  processes  were  the  computation  of  the  fitted 
regression  values  and  the  degrees  of  freedom  (or  effective  number  of  parameters).  These  portions 
of  code  perform  repetitive  computations,  require  a  longer  time  to  compute,  and  can  easily  be 
translated  to  a  parallel  environment.  Table  1  shows  the  breakdown  of  computation  time  for  the 
sequential  processing  of  the  complete  application  using  a  sample  size  of  500. 


Section  of  Application 

Time  (in  sec.) 

1.  Setup  of  working  storage 

0.6805 

2.  Compute  fitted  values 

4.1304 

3.  Compute  degrees  of  freedom 

111.2922 

4.  Compute  residuals,  residual  diagnostics,  cleanup 

0.6672 

Total  Time 

116.7703 

Table  1:  Breakdown  of  sequential  time  for  500  data  points. 


As  seen  in  the  table,  the  largest  computation  time  occurs  when  the  degrees  of  freedom  are 
computed.  In  Figure  2,  it  also  can  be  seen  that  the  time  required  for  computing  the  degrees  of 
freedom  increases  at  a  much  fester  rate  than  that  of  computing  the  fitted  values.  This  is  because  the 
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Figure  1:  Breakdown  of  Algorithm 


Sequential  Process 


Candidate  Process  #1 


Sequential  Process 


Candidate  Process  #2 


Sequential  Process 


Output 


Figure  2:  Computation  Time 
of  Candidate  Processes 


degrees  of  freedom  computation  requires  0(«2)  calculations.  The  number  of  calculations  for  this 
process  grows  at  a  rapid  rate  as  n  gets  large  (Figure  3).  When  this  process  is  parallelized  a 
significant  improvement  in  the  calculation  times  should  be  seen  as  the  calculations  are  spread 
across  the  nodes. 

On  the  other  hand,  the  computation  of  the  fitted  values  requires  a  much  smaller  number  of 
calculations.  This  may  be  surprising  since  a  regression  model  is  being  fitted  for  every  data  point, 
but  given  the  fact  that  the  model  is  fitting  only  a  small  number  of  points  in  the  local  neighborhood, 
the  number  of  calculations  is  of  0(nk)  where  nk  is  the  size  of  the  local  neighborhood.  This  keeps  the 
computation  time  small.  For  purposes  of  this  paper,  both  the  processes  will  be  moved  to  a  parallel 
environment.  However,  the  benefit  in  parallelizing  the  computation  of  fitted  values  will  be  small. 

After  the  candidate  processes  were  determined,  each  section  of  code  was  examined  to 
decide  the  best  approach  to  take  in  parallelizing  the  routines.  As  the  outlines  show,  the  data  are 
processed  in  a  sequential  manner  in  both  candidate  processes;  i.e.  a  loop  from  1  to  n.  In  addition, 
no  information  from  a  calculation  in  one  iteration  of  the  loop  is  required  at  the  next  iteration.  This 
means  that  the  calculations  can  be  performed  on  separate  nodes. 

Candidate  Process  #1  -  Compute  Fitted  Values 

The  loess  estimate  for  a  point  Xq  is  computed  in  the  following  way.  [5] 

1 .  Identify  the  Jt-nearest  neighbors  of  x^  for  k  =  N*span  where  the  span  is  percentage  of 

the  data  provided  by  the  user. 

2.  Compute  the  maximum  distance  of  the  £-nearest-neighbors  and  Xq. 

3.  Compute  the  weights  for  the  £-nearest-neighbors. 

4.  Fit  the  k-nearest-neighbors  using  weighted  least-squares. 

Candidate  Process  #2  -  Compute  degrees  of  freedom 

Using  Hat  Matrix  (H)  compute 

1.  Hq  =  I  -  H  where  I  is  the  Identity  matrix 
2.6,  =  tr(H0'H(/ 

3.  p  =  <5[2/62  where  p  is  the  degrees  of  freedom  [6] 

There  are  several  ways  to  parallelize  these  algorithms.  The  calculations  for  each  data  point 
requires  the  multiplication  of  several  matrices.  The  matrix  algebra  could  be  vectorized  but  the  time 
required  for  the  matrix  calculations  at  each  data  point  is  minimal  and  the  communication  overhead 
for  number  of  calculations  required  would  probably  increase  the  computation  time.  Since 
vectorization  is  not  practical,  the  best  approach  would  be  to  distribute  the  data  across  the  nodes  if 
the  data  set  is  extremely  large. 
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Figure  3:  Calculations  vs  Time 
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Implementation  of  the  Algorithm 


To  determine  the  computation  time  of  the  parallelized  routines,  five  samples  of  100,  200, 
300,  400,  and  500  data  points  with  two  parameters  were  used.  These  points  were  generated  from 
random  variables  where  Xi  ~  N(0,1),  X2  ~  E(l),  andY  ~  N(0,1).  Each  sample  was  processed 
across  N nodes  for  A=l,2,..,56.  The  timings  that  were  collected  during  the  runs  were  based  on 
elapsed  system  time.  The  candidate  processes  to  be  parallelized  are  the  computation  of  fitted 
regression  values  and  the  computation  of  degrees  of  freedom. 

Each  candidate  process  w  as  evaluated  to  determine  the  speedup  and  efficiency  due  to  the 
parallelization.  The  speedup  (S)  of  the  program  is  defined  as  the  ratio  of  the  time  it  takes  for  the 
application  to  execute  on  a  single  node  (Tj)  and  the  time  it  takes  for  the  application  to  execute  over 
p  nodes  (T^).  This  says  that  the  ideal  time  for  each  node  to  process  is  T ,/p,  which  implies  that 
Sideai  =  Tj/Tp  =  T,  /  (T ,//?)  =  p  The  speedup  is  effected  by  algorithmic  constructs,  overhead 
created  by  node  initialization  and  implementation,  load  balancing,  and  communication  overhead. 
The  efficiency  of  the  application  is  determined  by  the  ratio  of  the  speedup  to  the  number  of  nodes 
used;  e  =  />'//>.[  1]  Efficiency  describes  how  well  the  algorithm  was  parallelized.  The  more  work  a 
single  processor  has  to  do  because  of  the  parallelization,  the  lower  the  efficiency.  [4]  If  ideal 
speedup  is  achieved  then  the  efficiency  of  the  application  is  1 .0.  Since  the  ideal  speedup  will  most 
likely  not  be  achieved,  there  becomes  a  tradeoff  between  loss  of  efficiency  and  increased 
speedupf  1] 

Results 

As  illustrated  in  Figure  4,  the  computation  of  fitted  values  showed  some  improvement  as 
the  number  of  nodes  increased  but  after  the  addition  of  more  than  four  nodes,  the  speedup  of  the 
process  began  to  decline.  Eventually,  running  the  process  across  a  large  number  of  nodes  resulted 
in  worse  times  than  running  it  on  a  single  node.  It  is  suspected  that  the  communication  overhead  is 
having  an  effect  on  the  computation  time.  It  is  interesting  to  note  that  as  the  sample  size  increases 
the  speedup  decreases  over  all  nodes.  This  indicates  that  this  process  has  better  performance  in  a 
sequential  environment. 

On  the  other  hand,  the  computation  of  the  degrees  of  freedom  showed  a  dramatic  had  a 
dramatic  improvement  in  speedup,  as  seen  in  Figure  5.  But  like  the  first  process,  the  speedup 
begins  to  drop  off  as  the  number  of  nodes  increases  past  some  maximum  speedup  value.  As  the 
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Figure  4:  Speedup 
Computed  Regression  Values 


Figure  5:  Speedup 
Degrees  of  Freedom 


figure  also  shows,  when  the  sample  size  increases,  the  number  of  nodes  required  for  the  maximum 
speedup  also  increases. 

While  the  speedup  of  the  process  has  been  greatly  improved,  at  no  time  does  it  reach  its 
ideal  value.  Since  the  code  was  not  optimized  for  a  parallel  environment,  there  is  probably  some 
overhead  in  the  software  that  effects  the  speedup.  In  addition,  there  is  certain  to  be  communication 
overhead  as  the  number  of  nodes  increases. 

In  the  first  approach  at  parallelizing  the  second  algorithm,  the  times  seemed  to  have  a  lot 
of  noise  in  them  as  the  number  of  nodes  increased.  After  examining  the  data  it  became  apparent 
that  high  speedup  values  occurred  when  the  data  was  evenly  distributed  across  the  nodes.  Low 
speedup  values  occurred,  when  n-  \  nlp\  ^  0.  A  review  of  the  code,  pointed  to  a  load  balancing 
problem.  The  distribution  of  data  to  the  nodes  was  changed  so  that  a  node  received  either  [nip]  or 
[nip]  +  1  data  points.  This  lessened  the  impact  that  load  balance  had  on  speedup.  The  results  of 
this  modification  can  be  seen  in  Figure  6. 

If  we  look  at  efficiency,  in  Figures  7  &  8,  it  is  easily  seen  that  efficiency  drops  off  fairly 
quickly  when  more  nodes  are  added.  However,  as  the  sample  size  increases,  the  efficiency  tends  to 
increase  overall  the  nodes.  This  means  that  while  we  do  get  a  significant  speedup  using  a  large 
number  of  nodes,  the  algorithm  is  not  performing  as  well  as  expected  in  the  parallel  environment. 
The  algorithm  was  not  optimized  for  parallelization  and  this  may  have  an  impact  on  the  efficiency. 
Figure  9  illustrates  the  interaction  of  efficiency  and  speedup.  The  point  where  the  efficiency  begins 
to  fall  below  the  speedup  is  where  the  interaction  of  the  system  starts  to  have  an  effect  on  speedup. 
This  effect  may  be  a  result  of  how  the  algorithm  is  processed  by  the  node,  the  quantity  of  data 
passed,  or  some  other  system  dependent  factor. [4]  An  interesting  phenonmenon  occurs  when 
comparing  the  crossover  points  of  each  sample  size  (see  Figure  10).  The  points  where  speedup  and 
efficiency  lines  cross  for  each  sample  size  appear  at  exactly  the  same  node  for  each  sample  size. 
This  indicates  that  there  is  some  system  interaction  with  the  algorithm  that  is  not  data  dependent. 
Additional  research  is  required  to  determine  the  cause  of  this  system  interaction. 

Conclusions 

The  results  of  this  study  show  that  parallelizing  the  computation  of  the  degrees  of  freedom 
significantly  improves  the  performance  of  the  application  for  large  sample  sizes.  The  computation 
of  the  fitted  regression  values  has  little  if  any  improvement  and  should  probably  be  left  as  a 
sequential  process.  This  problem  was  a  good  example  for  showing  that  the  distribution  of  the 
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Load  Balancing  -  Sample  size  500 

(#)  -  size  of  node  imbalance 


Balanced  Unbalanced 


Figure  7:  Efficiency 
Computed  Regression  Values 


Figure  8:  Efficiency 
Degrees  of  Freedom 


Number  of  Nodes 


computations  across  the  nodes  can  effectively  reduce  the  amount  of  computation  time.  In  addition, 
it  demonstrated  the  impact  of  load  balancing  on  the  speedup  and  efficiency  of  an  application.  A 
maximum  speedup  value  occurs  at  a  particular  number  of  nodes  used.  Futhermore,  an  increase  in 
the  number  of  nodes  may  not  always  improve  the  performance  of  the  application.  While  the 
samples  sizes  used  in  the  study  were  relatively  small,  the  increase  in  speedup  over  the  selected 
sample  sizes  seems  to  indicate  that  for  sample  sizes  of  1000  or  10000  there  will  still  be  a 
significant  increase  in  the  speedup  of  the  computations. 
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cannot  be  parallelized  effectively. 


14.  SUBJECT  TERMS 

LOESS,  nonparametric  regression,  speedup,  efficiency 
load  balancing. 


15.  NUMBER  OF  PAGES 

18 


16.  PRICE  COOE 


17.  SECURITY  CLASSIFICATION 
OF  REPORT 

UNCLASSIFIED 


18.  SECURITY  CLASSIFICATION 
OF  THIS  PAGE 

UNCLASSIFIED 


19.  SECURITY  CLASSIFICATION 
OF  ABSTRACT 

UNCLASSIFIED 


20.  LIMITATION  OF  ABSTRACT 

UL 


NSN  7540-01-280-5500 


Standard  Form  298  (Rev  2-89) 

Prescribed  by  ANSI  Std  239-18 


MASTER  COPY 


KEEP  THIS  COPY  FOR  REPRODUCTION  PURPOSES 


REPORT  DOCUMENTATION  PAGE 


form  Approved 
OMB  NO  0704-0188 


Public  rcooninq  bufdnn  <0'  itw  roMtction  of  information  n  fstimatfd  to  jveragr  '  hour  o*'  response,  including  the  time  for  reviewing  instructions,  searching  enstmg  data  sources, 
oaihennq  and  maintaining  the  data  needed  and  completing  and  reviewing  the  collection  of  information  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  mo 
collection  of  information  including  suggestions  lor  reducing  this  burden  to  Washington  Headquarters  Services.  Directorate  for  information  Operations  and  Reports.  1 1 '  S  Jefferson 
Davis  Highway  Suite  004  Arlington,  VA  /J20/-410/  and  to  the  Office  of  Management  and  Budget  Paperwork  Reduction  Proiect  (0/04  0  188).  Washington.  DC  J0503 


1.  AGENCY  USE  ONLY  (Leave  blank ) 


t.  REPORT  DATE 

October,  1994 


4.  TITLE  AND  SUBTITLE 


3.  REPORT  TYPE  AND  DATES  COVERED 

Technical 


5.  FUNDING  NUMBERS 


Parallelizing  Locally-Weighted  Regression 


N00014-92-J-1303 


6.  AUTHOR(S) 


N0001 4-93-1-0527 


Julia  Corbin  Fauntleroy  and  Edward  J.  Wegman 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  AODRESS(ES) 


8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


Center  for  Computational  Statistics 
MS  4A7 

George  Mason  University 
Fairfax,  VA  22030-4444 


9.  SPONSORING  /MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 


TR  No. 


Department  of  the  Navy 
Office  of  the  Chief  of  Naval  Research 

Mathematical  Sciences  Division 
800  N.  Quincy  Street  Code  111 ISP 
Arlington.  VA  22217-5 


10.  SPONSORING /MONITORING 
AGENCY  REPORT  NUMBER 


11.  SUPPLEMENTARY  NOTES 

The  views,  opinions  and/or  findings  contained  in  this  report  are  those  of  the 
author (s)  and  should  not  be  construed  as  an  official  Department  of  the  Navy 
position,  policy,  or  decision,  unless  so  designated  by  other  documentation. 


12a.  DISTRIBUTION /AVAILABILITY  STATEMENT  I  12b.  DISTRIBUTION  CODE 


Approved  for  public  release;  distribution  unlimited. 


13.  ABSTRACT  (Maximum  200  words) 


This  paper  focuses  on  a  nonparametric  regression  technique  -known  as  locally-weighted 
regression  or  LOESS.  LOESS  is  a  computationally  intensive  technique  which  makes  it 
naturally  amenable  to  exploiting  high  performance  computers.  In  this  paper,  we 
explore  domain  decomposition  techniques  for  LOESS  and  study  the  performance  of  our 
algorithm  on  an  Intel  Paragon  XP/S  A4  machine.  We  study  both  speedup  and  efficiency 
as  a  function  of  thenumber  of  nodes.  Certain  segments  of  the  LOESS  computation  are 
shown  to  be  fruitfully  parallelized  while  others  are  essentially  sequential  and  cannot 
be  parallelized  effectively. 


14.  SUBJECT  TERMS 


LOESS,  nonparametric  regression,  speedup,  efficiency,  load 
balancing 


15.  NUMBER  OF  RAGES 

18 


16.  PRICE  CODE 


17.  SECURITY  CLASSIFICATION  18.  SECURITY  CLASSIFICATION  19.  SECURITY  CLASSIFICATION  I  20.  LIMITATION  OF  ABSTRACT 
OF  REPORT  OF  THIS  PAGE  OF  ABSTRACT 

UNCLASSIFIED  UNCLASSIFIED  UNCLASSIFIED  TTT 


UNCLASSIFIED 


NSN  7540-01 -280-5500 


Standard  Form  298  (Rev  2-89) 

Prescribed  bv  ANSI  Md 


